智能论文笔记

Monte Carlo Tree Search Algorithms for Risk-Aware and Multi-Objective Reinforcement Learning

Conor F. Hayes , Mathieu Reymond , Diederik M. Roijers , Enda Howley , Patrick Mannion

分类：人工智能 | 机器学习

2022-11-23

In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.

translated by 谷歌翻译

许多现实世界中的问题都包含多个目标和代理，其中目标之间存在权衡。解决此类问题的关键是利用代理之间存在的稀疏依赖性结构。例如，在风电场控制中，在最大化功率和最大程度地减少对系统组件的压力之间存在权衡。涡轮机之间的依赖性是由于唤醒效应而产生的。我们将这种稀疏依赖性模拟为多目标配位图（MO-COG）。在多目标强化学习实用程序功能通常用于对用户偏好而不是目标建模，这可能是未知的。在这种情况下，必须计算一组最佳策略。哪些策略是最佳的，取决于哪些最佳标准适用。如果用户的效用函数是从策略的多个执行中得出的，则必须优化标识的预期收益（SER）。如果用户的效用是从策略的单个执行中得出的，则必须优化预期的标量回报（ESR）标准。例如，风电场受到必须始终遵守的限制和法规，因此必须优化ESR标准。对于Mo-COG，最新的算法只能计算一组SER标准的最佳策略，而ESR标准进行了研究。要计算在ESR标准下（也称为ESR集合）下的一组最佳策略，必须维护回报上的分布。因此，为了计算MO-COGS的ESR标准下的一组最佳策略，我们提出了一种新型的分布多目标变量消除（DMOVE）算法。我们在逼真的风电场模拟中评估了DMOVE。鉴于实际风电场设置中的回报是连续的，我们使用称为Real-NVP的模型来学习连续的返回分布来计算ESR集合。

translated by 谷歌翻译

在许多实际情况下，用户的实用程序来自策略的单个执行。在这种情况下，要应用多目标增强学习，必须优化收益的预期效用。存在各种方案，其中用户对目标（也称为实用程序功能）的偏好是未知或难以指定的。在这种情况下，必须学习一组最佳政策。但是，多目标增强学习社区必须最大程度地忽略了必须最大程度地提高预期效用的设置，结果，一组最佳解决方案尚未定义。在本文中，我们通过提出一阶随机优势作为建立解决方案集以最大化预期效用的标准来应对这一挑战。我们还提出了一种新的优势标准，称为预期标量回报（ESR）优势，该标准率扩展了一阶随机优势，以允许在实践中学习一组最佳策略。然后，我们定义一个称为ESR集的新解决方案概念，该概念是ESR主导的一组策略。最后，我们定义了一种新的多目标分布表格增强学习（MOT-DRL）算法，以在多目标多臂强盗设置中学习设置的ESR。

translated by 谷歌翻译